-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
ENH: Add I/O support of XML with pandas.read_xml and DataFrame.to_xml… #39516
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
pandas/io/formats/xml.py
Outdated
|
||
try: | ||
if self.io: | ||
with open(self.io, "wb") as f: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you want to support compression/fsspec/buffers and so on:
with get_handle(self.io, mode="wb", is_text=False, storage_options=..., compression=...) as handles:
handles.handle.write(xml_doc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not strong on these features. I attempted compression in a draft version of read_xml
looking at _json.py
but implementation would require workarounds. Can this be for future PR? If not, I would need several days to incorporate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
totally fine to defer
just add a check list of todos (top of PR is ok or new issue)
pandas/io/xml.py
Outdated
as string, depending on object type. | ||
""" | ||
|
||
obj = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could boil down to the following (except for str/bytes)
with get_handle(self.io, mode="r", is_text=True, encoding=self.encoding) as handles:
obj = handles.handle.read()
Accepting strings of XML and strings representing filepaths might get tricky in some cases?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good. a number of comments on the structure. will give more specific comments on the code itself after reorg. the tests look ok so far. need a bunch more on error testing. we want virtually every error line tested (meaning anytime you are raising an excpetion each tyep should be tested).
its ok to add a checklist to the issue to track things (some of which can also be in followup PRs)
doc/source/whatsnew/v1.3.0.rst
Outdated
.. _whatsnew_130.read_to_xml: | ||
|
||
We added I/O support to read and render shallow versions of XML documents with | ||
:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a reference to lxml (same one as we have in install.rst)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
doc/source/whatsnew/v1.3.0.rst
Outdated
</row> | ||
</data>""" | ||
|
||
df = pd.read_xml(xml) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you need to show the rendered df, so end the ipython block here, and then add another one for the .to_xml()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/core/frame.py
Outdated
@@ -2604,6 +2604,178 @@ def to_html( | |||
render_links=render_links, | |||
) | |||
|
|||
def to_xml( | |||
self, | |||
io: Optional[FilePathOrBuffer[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name this path_or_buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/io/formats/format.py
Outdated
@@ -1003,6 +1005,121 @@ def to_html( | |||
string = html_formatter.to_string() | |||
return save_to_buffer(string, buf=buf, encoding=encoding) | |||
|
|||
def to_xml( | |||
self, | |||
io: Optional[FilePathOrBuffer[str]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
pandas/io/formats/xml.py
Outdated
if isinstance(self.stylesheet, str): | ||
obj = self.stylesheet | ||
|
||
if isinstance(self.stylesheet, bytes): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if/elif
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but see @twoertwein comments
fallback option with etree parser. | ||
""" | ||
|
||
if parser == "lxml": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like this is repeated logic here. need to handle this centrally.
pandas/io/xml.py
Outdated
return _data_to_frame(data=data_dicts, **kwargs) | ||
|
||
|
||
@deprecate_nonkeyword_arguments(version="2.0") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you don't need to do this, just make everything keyword only (except for io) but rename that to path_or_buf
</data>""" | ||
|
||
|
||
@pytest.mark.parametrize("parser", ["lxml", "etree"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this can be a fixture
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -33,6 +33,80 @@ For example: | |||
storage_options=headers | |||
) | |||
|
|||
.. _whatsnew_130.window_method_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like you picked up another change here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I merge latest? And should I add XML section to io.rst
or handle in different PR?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should always merge latest every time you are pushing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docs for io.rst in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added XML section and updated top-level table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great, this still looks like an artfiact from a merge
doc/source/whatsnew/v1.3.0.rst
Outdated
We added I/O support to read and render shallow versions of XML documents with | ||
:func:`pandas.read_xml` and :meth:`DataFrame.to_xml`. Using lxml as parser, | ||
full XPath 1.0 and XSLT 1.0 is available. (:issue:`27554`) | ||
======= |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rebase issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will clean up
just a note here. we can merge this as long as it passes all tests and is mostly complete. meaning can certainly have a followup issue to tackle small things / corrections / xfails. |
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -33,6 +33,80 @@ For example: | |||
storage_options=headers | |||
) | |||
|
|||
.. _whatsnew_130.window_method_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should always merge latest every time you are pushing
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -33,6 +33,80 @@ For example: | |||
storage_options=headers | |||
) | |||
|
|||
.. _whatsnew_130.window_method_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add docs for io.rst in this PR
doc/source/whatsnew/v1.3.0.rst
Outdated
@@ -33,6 +33,80 @@ For example: | |||
storage_options=headers | |||
) | |||
|
|||
.. _whatsnew_130.window_method_table: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is a top-level table in io.rst that needs updating as well (for the I/O read/write methods near the top)
pandas/io/formats/xml.py
Outdated
style_doc = io.StringIO(style_doc) | ||
|
||
handle_data = self._get_data_from_filepath(style_doc) | ||
xml_data = self._preprocess_data(handle_data) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to use _get_data_from_filepath
here too? Maybe extend it to handle string
and bytes
as well. That would avoid creating a StringIO
/BytesIO
that isn't closed (I don't think that unclosed StringIO
/BytesIO
trigger a ResourceWarning
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call. I used StringIO
here to pass mypy (which interestingly does not raise on similar line in pandas.io.xml for read_xml
maybe due to ternary operator). But I can move the XML string to bytes conversion in _get_data_from_filepath
which passes mypy and avoids io objects.
def _get_data_from_filepath(self, filepath_or_buffer):
def _get_data_from_filepath(self, filepath_or_buffer):
"""
Extract raw XML data.
...
"""
filepath_or_buffer = stringify_path(filepath_or_buffer)
if (
isinstance(filepath_or_buffer, str)
and not filepath_or_buffer.startswith(("<?xml", "<"))
) and (
not isinstance(filepath_or_buffer, str)
or is_url(filepath_or_buffer)
or is_fsspec_url(filepath_or_buffer)
or file_exists(filepath_or_buffer)
):
with get_handle(
filepath_or_buffer,
"r",
encoding=self.encoding,
compression=self.compression,
storage_options=self.storage_options,
) as handle_obj:
filepath_or_buffer = (
handle_obj.handle.read()
if hasattr(handle_obj.handle, "read")
else handle_obj.handle
)
return filepath_or_buffer
I need to convert, otherwise, method errs with below traceback. Any thoughts for conversion workaround?
Traceback (most recent call last):
...
File "/home/pandas-parfaitg/pandas/io/xml.py", line 213, in _get_data_from_filepath
with get_handle(
File "/home/pandas-parfaitg/pandas/io/common.py", line 593, in get_handle
ioargs = _get_filepath_or_buffer(
File "/home/pandas-parfaitg/pandas/io/common.py", line 362, in _get_filepath_or_buffer
file_obj = fsspec.open(
File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 429, in open
return open_files(
File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 280, in open_files
fs, fs_token, paths = get_fs_token_paths(
File "/opt/conda/lib/python3.8/site-packages/fsspec/core.py", line 600, in get_fs_token_paths
cls = get_filesystem_class(protocol)
File "/opt/conda/lib/python3.8/site-packages/fsspec/registry.py", line 204, in get_filesystem_class
raise ValueError("Protocol not known: %s" % protocol)
ValueError: Protocol not known: <?xml version='1.0' encoding='utf-8'?>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
based on the traceback it seem you are you passing an XML string/bytes to get_handle
. I think in your case you can use get_handle
for everything except 1) strings representing XML data, 2) any bytes objects, and 3) None
(if that is possible).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See updated code block above, _get_data_from_filepath
had conditionals before passing into get_handle
(omitted earlier for brevity). My addition for your 1 is first if
condition (which does not raise errors, passes tests, and mypy). I even add tests for a not file or buffer object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remaining CI test fails do not relate to this PR with pandas.io.xml but another test, specifically: pandas.tests.arrays.sparse.test_array. Should I keep upstream/merge
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great! The IO side looks good to me!
Is xml_data
always opened by pandas or is it possible that it is a user-provided file object? If it is always opened by pandas you could have something like the following to make sure that the StringIO/BytesIO are always closed:
with self._preprocess_data(handle_data) as xml_data:
....
I think there are a few more places where this could be used (if xml_data
cannot be a user-provided file handle).
Remaining CI test fails do not relate to this PR with pandas.io.xml but another test, specifically: pandas.tests.arrays.sparse.test_array. Should I keep upstream/merge?
Multiple commits on master have the same failing test. I wouldn't worry about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looking great! I don't think you need to add tests for not intended user input (None/DataFrame).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. But for None
, what if user sends a variable to read_xml
that contains NoneType
? Possibly initialized but never given a value later or variable was assigned None
at end of an API process.
pandas/tests/io/test_xml.py
Outdated
@@ -237,6 +236,28 @@ def test_file_buffered_reader_no_xml_declaration(datapath, parser, mode): | |||
tm.assert_frame_equal(df_str, df_expected) | |||
|
|||
|
|||
@td.skip_if_no("lxml") | |||
def test_closed_file_lxml(datapath): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can probably parametrize these two tests.
In general, I'm not sure whether it is necessary to enforce generic error messages for "obviously" wrong inputs (None/closed files handles). @jreback
One test to add (or extending an existing test) is to make sure that a user-provided file handle is not closed by read/to_xml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Understood. I can remove those wrong input tests. I tried simulating how users may behave (having answered many StackOverflow pandas answers from newbies!). Will parametrize and add file handle close tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a number of comments. overall looks really good.
I think might be ok to centralize all testing in pandas/tests/io/xml/test_to_xml.py and so on (you can move all these tests there)
pandas/io/formats/xml.py
Outdated
including replacing missing entities and including indexes. | ||
""" | ||
|
||
na_dict = {"None": self.na_rep, "NaN": self.na_rep, "nan": self.na_rep} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ParfaitG can you update this
pandas/io/formats/xml.py
Outdated
|
||
raise AbstractMethodError(self) | ||
|
||
def _get_data_from_filepath(self, filepath_or_buffer): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/io/formats/xml.py
Outdated
|
||
return filepath_or_buffer | ||
|
||
def _preprocess_data(self, data): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will do.
pandas/io/formats/xml.py
Outdated
The data either has a `read` attribute (e.g. a file object or a | ||
StringIO/BytesIO) or is a string or bytes that is an XML document. | ||
""" | ||
if isinstance(data, str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm why do we accept bytes here? (and not just string)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is this a method on the class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both _get_data_from_filepath
and _preprocess_data
methods were borrowed from pandas.io.json._json inside the JsonReader
class. How to adjust here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, ok yeah just make a module level function if you need to do this then
pandas/io/xml.py
Outdated
children = self.xml_doc.xpath(self.xpath + "/*", namespaces=self.namespaces) | ||
attrs = self.xml_doc.xpath(self.xpath + "/@*", namespaces=self.namespaces) | ||
|
||
if (elems == [] and attrs == [] and children == []) or ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a very strange condition, can you pull out the attrs & children. can you use not children
and so on here
return tp.read() | ||
except ParserError: | ||
raise ParserError( | ||
"XML document may be too complex for import. " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this hit in tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is on to-do list for tests. A catch-all for edge cases that passes other checks but fails here. There may be a complex XML that I have not anticipated. Otherwise I can let TextParser's TypeError
raise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
kk sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO list is fine
pandas/tests/io/test_xml.py
Outdated
with open(xsl, mode) as f: | ||
xsl_obj = f.read() | ||
|
||
read_xml(kml, stylesheet=xsl_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no comparisons? (but should check something minimal at least)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm. @pandas-dev/pandas-core if anyone would like to review. will merge in a few days otherwise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really impressive - nice work! Small comments but given how large this PR is I would be OK with merging and tackling as follow ups
pandas/io/formats/xml.py
Outdated
return bytes(new_doc) | ||
|
||
|
||
def _get_data_from_filepath( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks to be one of the only functions not typed; always nice to have (can be done in a follow up)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also is this a copy of the function in the other xml.py
module? Would be nice to de-duplicate
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Re parse_doc
not typed in Lxml classes, originally I had it typed but lxml unlike etree has its _Element
and _ElementTree
objects as private variables (with leading underscores) which fails flake8
on import. Can ignore.
Yes, methods do repeat. Can pull out of class as module level method in pandas.io.import (i.e., read_xml
) to be imported in pandas.io.formats.xml (i.e., to_xml
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Digging deeper, the private variables were from modified class objects. lxml does use same named types as etree. Adjusted accordingly.
functionality. | ||
""" | ||
|
||
def __init__( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar comment on typing for this method
thanks @ParfaitG really nice. please issue PRs for followups when you can. You may want to move the list into an issue for tracking. |
Thanks @twoertwein for your tremendous help on I/O side! |
… (GH27554)
To Do List
parse_dates
feature forread_xml
.storage_options
.ParserError
,OSError
,URLError
, etc.xpath_vars
feature to pass$
variables inxpath
expression. See lxml xpath() method.xsl_params
feature to pass values into XSLT script. See lxml stylesheet parameters.iterparse
feature for memory efficient parsing of large XML. See etree iterparse and lxml iterparse.